Exploratory Data Analysis on Brazil E-Commerce Olist Stores Data¶

. Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners.

The dataset contains several .csv which contains information likes,order,customer,seller,feedback,product,geographic-location etc. After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments. . The dataset has been downloaded from Kaggle

What is Exploratory Data Analysis holds for ?¶

Now a days data which is kind of information act as resources,and we often dig down and deeper inside to extract some information out of it. Hence EDA is kind of resource extraction and value addition in the our data. Apart from this other objective of the EDA is explore the unknown terrians of data and gather the some useful insights out of it.

Below are the outline as well as squential steps taken to dig down to dataset:

  • Creating the enviroment for running code cell on the Colab and Jupyternotebook.
  • Installation of required packages and their upgradation (if required).
  • Downloading the dataset from kaggle with the help of Kaggle account credential like username and api token.
  • Performing the data-sechema and merging into one dataframe and cleaning with Pandas.

  • Refering the individual columns of the final-data frame framing the question to gather useful insights.

  • Pictorial representation of the answer of the questions frame.
  • Summarizing inferences and drawing conclusions.

Installing and importing required packages¶

From the variety of option for executing the code on Jupyternotebook,kaggle,binder to Google cloud platform Colab, with very minimistilic or no changes code can be executed on the just few click of run button ( with make sure availability of all required packages).

Since dataset contains 89K of rows and 41 columns, so choosing the platform which can facilitate better ram for computing could be better options for getting smother executions.

Whereas the downloading the dataset from the Kaggle as well as commiting the notebook requires "API " Keys which one could get from their respective account credential.

In the process of EDA we will be using multiple library like numpy,seaborn,pandas,folium,wordcloud and plotly from python ecosystem. So let using python interpretor instalation to install the required packages

In [1]:
!pip install plotly opendatasets --quiet
!pip install seaborn --quiet
!pip install folium --quiet
!pip install wordcloud --quiet
!pip install pandas numpy --quiet
!pip install geopy --quiet
!pip install jsonpickle --quiet

After installing required libraries we will import those libraries and utilized for required work

In [14]:
# used to download a dataset from kaggle
import opendatasets as od
# Data analysis library in python
import pandas as pd
# numerical computing library
import numpy as np
# common uses of relative to find out date and time remaining or to apply any datetime calculation
from dateutil.relativedelta import relativedelta
# Python module named datetime used for working with dates as date objects
from datetime import datetime
# wordcloud tool can be used to generate clouds made up of large pool of words
from wordcloud import WordCloud
# for creating static visualizations in Python
import matplotlib.pyplot as plt
# a high-level interface for drawing informative statistical graphics.
import seaborn as sns
#for creating interactive visualizations
import plotly.express as px
#Plotting geospatial data with Folium
import folium
# to create wordcloud in python
import wordcloud
# Python module for tackling the javascript file 
import json
# jsonpickle is a library for the two-way conversion of complex Python objects and JSON
import jsonpickle
#  module’s encoders and decoders preserve input and output order by default
from json import JSONEncoder    
import plotly.data as pdata
# Avoid graphs to be shown as pop-up rather shown in same command line
%matplotlib inline
import matplotlib
import multiprocessing, json
from matplotlib.gridspec import GridSpec
from matplotlib import rcParams
# Different module for making the different types of Folium graphs
from folium.plugins import FastMarkerCluster, Fullscreen, MiniMap, HeatMap, HeatMapWithTime, LocateControl

Downloading required dataset:¶

The dataset has been downloaded from kaggle using the kaggle account username and api_keys.

In [3]:
url='https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?select=olist_geolocation_dataset.csv'
od.download(url)

# Give kaggle account credential

#len(merged_df.columns)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: gautamkumarjha
Your Kaggle Key: ········
Downloading brazilian-ecommerce.zip to ./brazilian-ecommerce
100%|██████████| 42.6M/42.6M [00:01<00:00, 40.9MB/s]

Data Preparation, cleaning dataset:¶

  • Load the individual .csv file using pandas
  • Explore the different columns of the dataframe and checking their datatype, size and shape
  • Identify the condition from the dataset schema to merge into the final dataset
  • Make a copy of the final merged dataframe
  • Drop the columns which doesn't add much value in analysis
  • Start diging into the dataframe
In [98]:
jovian.commit()
[jovian] Updating notebook "jhagautamkumar362/eda-project-working" on https://jovian.ai
[jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/eda-project-working
Out[98]:
'https://jovian.ai/jhagautamkumar362/eda-project-working'
In [4]:
df1=pd.read_csv('brazilian-ecommerce/olist_order_payments_dataset.csv',index_col= False)

df2=pd.read_csv('brazilian-ecommerce/olist_products_dataset.csv',index_col= False)

df3=pd.read_csv('brazilian-ecommerce/olist_order_reviews_dataset.csv',index_col= False)

df4=pd.read_csv('brazilian-ecommerce/olist_orders_dataset.csv',index_col= False)

df5=pd.read_csv('brazilian-ecommerce/olist_order_items_dataset.csv',index_col= False)
                
df6=pd.read_csv('brazilian-ecommerce/olist_sellers_dataset.csv',index_col= False)
                
df7=pd.read_csv('brazilian-ecommerce/olist_geolocation_dataset.csv',index_col=False)
                
df8=pd.read_csv('brazilian-ecommerce/olist_customers_dataset.csv',index_col= False)

#df1=pd.read_csv('brazilian-ecommerce/olist_customers_dataset.csv',index_col= False)
#df2=pd.read_csv('brazilian-ecommerce/olist_geolocation_dataset.csv',index_col= False)
#df3=pd.read_csv('brazilian-ecommerce/olist_order_items_dataset.csv',index_col= False)
#d=pd.read_csv('brazilian-ecommerce/olist_order_payments_dataset.csv',index_col= False)
#df5=pd.read_csv('brazilian-ecommerce/olist_order_reviews_dataset.csv',index_col= False)
#df6=pd.read_csv('brazilian-ecommerce/olist_orders_dataset.csv',index_col= False)
#df7=pd.read_csv('brazilian-ecommerce/olist_products_dataset.csv',index_col= False)
#df8=pd.read_csv('brazilian-ecommerce/olist_sellers_dataset.csv',index_col= False)
#df9=pd.read_csv('brazilian-ecommerce/product_category_name_translation.csv',index_col= False)

Dataset Schema:

The downloaded dataset contains 9 different csv file. So before performing EDA we should follow the schema of dataset to merge all the csv file into one single .csv file, which can ease out the process of performing different python operation.

For Ex. data order_reviews_dataset,order_payments_dataset,order_customer_dataset,order_reviews etc dataset will be mereged on the basis of the common columns like order_id and similary other dataframe can be merged by following the schema.

Merging different dataframe :¶

Instead of working with bits and bytes of dataframe it's much better to work one dataframe which can be done by combing all the dataframe as merged dataframe.Then we can perform our analysis

In [5]:
df_merge=df2.merge(df5,on='product_id')
df_merge=df_merge.merge(df6,on='seller_id')
df_merge=df_merge.merge(df1,on='order_id')
df_merge=df_merge.merge(df3,on='order_id')
df_merge=df_merge.merge(df4,on='order_id')
df_merge=df_merge.merge(df8,on='customer_id')
In [6]:
df_merge.to_csv('df_merge_1,2,3,4,5,6,8.csv')
In [7]:
# Converting the final merged dataframe into single csv file 
df=pd.read_csv('df_merge_1,2,3,4,5,6,8.csv')

Let's extract a copy of the data from these columns into a new data frame df_copy. We can continue to modify further without affecting the original data frame.

In [8]:
df_copy=df.copy()

Data preparation for EDA:¶

Before the drawing the useful insight from the dataset, we will be checking different information like number of columns,rows,data types etc.By performing this steps we can get fair estimate of the dataset, on which we build our EDA

Let's view the dataset shape,size and columns

In [9]:
n_rows,n_column=df_copy.shape
n_column
Out[9]:
40
In [23]:
df_copy.shape
Out[23]:
(117329, 39)

Since all the required columns in the dataset doesn't have equal importance in terms of the analysis to be done or carried , so its bettter to drop the required columns

In [9]:
# Re running of only this cell will throw error because the columns has been already been deleted
df_copy = df_copy.drop('Unnamed: 0',axis=1)
In [10]:
df_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117329 entries, 0 to 117328
Data columns (total 39 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   product_id                     117329 non-null  object 
 1   product_category_name          115634 non-null  object 
 2   product_name_lenght            115634 non-null  float64
 3   product_description_lenght     115634 non-null  float64
 4   product_photos_qty             115634 non-null  float64
 5   product_weight_g               117309 non-null  float64
 6   product_length_cm              117309 non-null  float64
 7   product_height_cm              117309 non-null  float64
 8   product_width_cm               117309 non-null  float64
 9   order_id                       117329 non-null  object 
 10  order_item_id                  117329 non-null  int64  
 11  seller_id                      117329 non-null  object 
 12  shipping_limit_date            117329 non-null  object 
 13  price                          117329 non-null  float64
 14  freight_value                  117329 non-null  float64
 15  seller_zip_code_prefix         117329 non-null  int64  
 16  seller_city                    117329 non-null  object 
 17  seller_state                   117329 non-null  object 
 18  payment_sequential             117329 non-null  int64  
 19  payment_type                   117329 non-null  object 
 20  payment_installments           117329 non-null  int64  
 21  payment_value                  117329 non-null  float64
 22  review_id                      117329 non-null  object 
 23  review_score                   117329 non-null  int64  
 24  review_comment_title           13892 non-null   object 
 25  review_comment_message         49679 non-null   object 
 26  review_creation_date           117329 non-null  object 
 27  review_answer_timestamp        117329 non-null  object 
 28  customer_id                    117329 non-null  object 
 29  order_status                   117329 non-null  object 
 30  order_purchase_timestamp       117329 non-null  object 
 31  order_approved_at              117314 non-null  object 
 32  order_delivered_carrier_date   116094 non-null  object 
 33  order_delivered_customer_date  114858 non-null  object 
 34  order_estimated_delivery_date  117329 non-null  object 
 35  customer_unique_id             117329 non-null  object 
 36  customer_zip_code_prefix       117329 non-null  int64  
 37  customer_city                  117329 non-null  object 
 38  customer_state                 117329 non-null  object 
dtypes: float64(10), int64(6), object(23)
memory usage: 34.9+ MB

The final dataset contains 10 floats,5 ints,24 object datatype in our dataframe.So it seems that dataset doesn't contains any null values.

We'll convert the object type date column into datetime type using pd.to_datetime

Let's use pandas describe method to get statistical calculations of numerical columns.

In [12]:
df_copy.describe().style.background_gradient(cmap='Greens',axis=None)
Out[12]:
  product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm order_item_id price freight_value seller_zip_code_prefix payment_sequential payment_installments payment_value review_score customer_zip_code_prefix
count 115634.000000 115634.000000 115634.000000 117309.000000 117309.000000 117309.000000 117309.000000 117329.000000 117329.000000 117329.000000 117329.000000 117329.000000 117329.000000 117329.000000 117329.000000 117329.000000
mean 48.768018 785.802861 2.205528 2110.763062 30.254456 16.612476 23.071452 1.194121 120.524349 20.027514 24450.781955 1.094452 2.940151 172.062565 4.031467 35060.118112
std 10.033831 652.382965 1.717783 3785.128931 16.177519 13.452625 11.745875 0.684225 182.944843 15.828077 27582.364358 0.731174 2.775370 265.388194 1.387927 29849.496175
min 5.000000 4.000000 1.000000 0.000000 7.000000 2.000000 6.000000 1.000000 0.850000 0.000000 1001.000000 1.000000 0.000000 0.000000 1.000000 1003.000000
25% 42.000000 346.000000 1.000000 300.000000 18.000000 8.000000 15.000000 1.000000 39.900000 13.080000 6429.000000 1.000000 1.000000 60.750000 4.000000 11250.000000
50% 52.000000 600.000000 1.000000 700.000000 25.000000 13.000000 20.000000 1.000000 74.900000 16.280000 13660.000000 1.000000 2.000000 108.100000 5.000000 24240.000000
75% 57.000000 983.000000 3.000000 1800.000000 38.000000 20.000000 30.000000 1.000000 134.900000 21.180000 28035.000000 1.000000 4.000000 189.060000 5.000000 58770.000000
max 76.000000 3992.000000 20.000000 40425.000000 105.000000 105.000000 118.000000 21.000000 6735.000000 409.680000 99730.000000 29.000000 24.000000 13664.080000 5.000000 99990.000000
In [13]:
df_copy.tail().style.background_gradient(cmap='Greens',axis=0)
Out[13]:
  product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm order_id order_item_id seller_id shipping_limit_date price freight_value seller_zip_code_prefix seller_city seller_state payment_sequential payment_type payment_installments payment_value review_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date customer_unique_id customer_zip_code_prefix customer_city customer_state
117324 daf9915404caf75fa1717a436b46fa14 moveis_decoracao 63.000000 379.000000 5.000000 1400.000000 33.000000 15.000000 30.000000 5d36e8ca3ccd1262b7641467fdba0b8e 1 0f519b0d2e5eb2227c93dd25038bfc01 2017-01-31 12:10:46 79.900000 23.000000 14940 ibitinga SP 1 credit_card 1 102.900000 e587cf1954fc358343a8c510c8694269 5 nan Muito obrigada...vcs foram pontuais e uma loja com grande responsabilidade... 2017-02-10 00:00:00 2017-02-11 13:30:56 3b2b96751131c948164edc5d21d27abb delivered 2017-01-27 12:10:46 2017-01-27 12:31:29 2017-01-30 10:26:15 2017-02-08 18:42:20 2017-03-20 00:00:00 55b3a0a9998356016bbb072bf8c188a6 60830 fortaleza CE
117325 b0498e44190727b728ae4490f2e9b6a5 malas_acessorios 25.000000 712.000000 4.000000 5250.000000 34.000000 23.000000 40.000000 8edaa376e19d08bc84ab8845682216b4 1 dda37071807e404c5bb2a1590c66326f 2018-04-30 09:30:37 199.990000 19.280000 3282 sao paulo SP 1 credit_card 1 219.270000 7bbe7636141510158aa577dcfd017b13 5 nan nan 2018-05-11 00:00:00 2018-05-13 00:12:43 4c1e29ec2ed2feac441cf24b25262ed2 delivered 2018-04-24 08:37:20 2018-04-24 17:26:25 2018-04-25 12:24:00 2018-05-10 22:36:37 2018-05-14 00:00:00 913a4e0cb7fe555e6cffa875ecd58e2e 28455 sao jose de uba RJ
117326 80b391b2dc6c958ef3ad34fa7ee01423 automotivo 17.000000 306.000000 4.000000 200.000000 26.000000 4.000000 13.000000 635c8d3ab47b7448b5d6c81dc9b7526e 1 3aa3c89ae3cd482385568be76120f63c 2018-06-14 02:57:41 29.900000 19.040000 74343 goiania GO 1 boleto 1 48.940000 0a954efcadecd2bfa871c46f5fd05308 4 nan muito bom 2018-06-19 00:00:00 2018-06-20 01:29:30 d7b2339f118be9c6d98389f6d0c370d4 delivered 2018-06-07 22:36:29 2018-06-09 02:57:41 2018-06-14 13:59:00 2018-06-18 18:33:50 2018-06-26 00:00:00 0c2bb0bab2bf22858fa762d320a326bb 3729 sao paulo SP
117327 2b6535d32c6996c9478c131a8ff17a05 construcao_ferramentas_seguranca 50.000000 428.000000 2.000000 333.000000 16.000000 9.000000 10.000000 b3e5d245a95dd6378330cc2249d9d6b4 1 e64882b4ef12aee8d0faca4db5b681f4 2018-01-18 02:06:32 115.000000 21.610000 74210 goiania GO 1 boleto 1 273.220000 9802486c19789491798384927fd28c30 4 nan nan 2018-02-07 00:00:00 2018-02-07 20:09:14 e08e569cab13d0c5d8b0899056edf5af delivered 2018-01-12 15:28:49 2018-01-13 02:06:32 2018-01-16 18:19:06 2018-02-06 15:41:31 2018-02-08 00:00:00 3ac652bdb8a0de18c0a73ce11f88efed 88075 florianopolis SC
117328 2b6535d32c6996c9478c131a8ff17a05 construcao_ferramentas_seguranca 50.000000 428.000000 2.000000 333.000000 16.000000 9.000000 10.000000 b3e5d245a95dd6378330cc2249d9d6b4 2 e64882b4ef12aee8d0faca4db5b681f4 2018-01-18 02:06:32 115.000000 21.610000 74210 goiania GO 1 boleto 1 273.220000 9802486c19789491798384927fd28c30 4 nan nan 2018-02-07 00:00:00 2018-02-07 20:09:14 e08e569cab13d0c5d8b0899056edf5af delivered 2018-01-12 15:28:49 2018-01-13 02:06:32 2018-01-16 18:19:06 2018-02-06 15:41:31 2018-02-08 00:00:00 3ac652bdb8a0de18c0a73ce11f88efed 88075 florianopolis SC

Open-ended questions for exploratory data analysis and visualization:¶

By looking the individual columns we can frame open-ended question which is proven to be backbone for the EDA and we will try to answer those question either my numeric representation or visualization representation.

Let's set the chart style, font size, figure size and figure facecolour using matplotlib. These are reflected only to matplot and seaborn charts.

In [15]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Price¶

Among different columns of the dataset,price columns gives the estimate of among of the money spend my the cutomers in his/her purchase. So let's explore this columns:

In [16]:
print('Min money spend by customer is {} R$'.format(df_copy.price.min()))
print('\nHighest money spend by customer is {} R$'.format(df_copy.price.max()))
print('\nAvg money spend by customer is {} R$'.format(df_copy.price.mean()))
Min money spend by customer is 0.85 R$

Highest money spend by customer is 6735.0 R$

Avg money spend by customer is 120.52434879709405 R$

Mode of Payment¶

On exploring this columns of the dataframe we would get information on among different mode of payment, which mode of payment is most favoured by the customer, followed by other modes. So lets explore this columns

In [14]:
# Checking for different type of values contained in column "payment_type_unique()"
df_copy.payment_type.unique()
Out[14]:
array(['debit_card', 'credit_card', 'boleto', 'voucher'], dtype=object)
In [15]:
# Getting the number for each category of payment being done 

df_copy.payment_type.value_counts()
Out[15]:
credit_card    86520
boleto         22835
voucher         6282
debit_card      1692
Name: payment_type, dtype: int64

Now we want to seperate the text and number which is useful for making the pie chart. So let's convert from dataframe to dictionary, so that we could access keys and values seperately.

In [37]:
payment_dict=df_copy.payment_type.value_counts().to_dict()
payment_dict
Out[37]:
{'credit_card': 86520, 'boleto': 22835, 'voucher': 6282, 'debit_card': 1692}
In [80]:
list(payment_dict.values())
Out[80]:
[86520, 22835, 6282, 1692]
In [81]:
list(payment_dict.keys())
Out[81]:
['credit_card', 'boleto', 'voucher', 'debit_card']

With the help of above information we can make pie chart

In [38]:
fig = px.pie(df_copy.payment_type.value_counts()
, values=list(payment_dict.values()), names= list(payment_dict.keys()), #color= list(payment_dict.keys()),
            
            color_discrete_map={'Credit_card':'lightcyan',
                                 'Debit_card':'cyan',
                                 'Voucher':'royalblue',
                                 'Boleto':'darkblue'},
            hole=.4,
            title='<b>Mode of payment did by customer at Olist stores</b>',
            template = "none"
            )
fig.show()

The inferences which i could draw from this pie chart is as following:

  • 73.7 % of People gives their payment via Credit card for purchase good at Olist stores.
  • 19.4 % of People did their payment via boleto at Olist stores
  • 5.46 % of People gives their payment via voucher system at Olist stores
  • And remaining which is of 1.4 % of customers pay their charges for the purchase good at olist stores via debit card

So this reflect that's majority of customer likes to pay their charges via credit card where as very small amount of people pay their charges via debit_card, which implies that E-Commerce stores needs to have card swiping machine really in good condition, altering which could hamper the revenue generation of the stores as well as difficulty for customer. If stores wants to recive the payment via debit-card then need some offers or some coupons or else some kind of endorsement to encourage the other mode of payment at stores.

Custom review score out of 5 points¶

For the overall experiances of customer in exchange of the amount of time spent as well goods and services purchased from the Olist stores, customer gives review score which is out of 5. So let's deep dive to see what % of customer gives 5 ,4,3,2 and 1 review scores.

In [22]:
x_1,y_1=df_copy[df_copy.review_score== 1].shape
x_2,y_2=df_copy[df_copy.review_score== 2].shape
x_3,y_3=df_copy[df_copy.review_score== 3].shape
x_4,y_4=df_copy[df_copy.review_score== 4].shape
x_5,y_5=df_copy[df_copy.review_score== 5].shape
print('{}% of Product got 1 score rating'.format(x_1*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 2 score rating'.format(x_2*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 3 score rating'.format(x_3*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 4 score rating'.format(x_4*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 5 score rating'.format(x_5*100/(x_1+x_2+x_3+x_4+x_5)))
12.773205331300547% of Product got 1 score rating

3.456531319604083% of Product got 2 score rating

8.406684374143907% of Product got 3 score rating

19.118119989994877% of Product got 4 score rating

56.245458984956585% of Product got 5 score rating

So from this we could infer that out of all product, products for which 5.0 review score given is more and maximum than any of the review score given for any other product.

  • Out of several of review score given by the customers 56.245 % is given for 5 score, which is maximum in its kinds
  • Among several of review scores given by the customers 19.11 % is given for 4 score, which is 2nd highest in its kinds
  • Where as out of several of review score given by customers 12.77 % is given for 1 score , followed by 3.45 % for 2 score and 8.4066 for 3 score.

So among all the customers who have spend their money in purchasing goods and services at Olist stores 56.6 % of people are very happy for the kinds of services they got on the stores, where as there are considerable amount of 12.77 % people who feel or not happy with the kinds of services they got on Olist stores.

Number of installments for payment of their charges¶

As from the Mode of payment columns we come to know that maximum customer pay their charges via credit card. And for the total amount they paid at Olist stores, instead of paying bulk amount customer pay in chunk of smaller amount periodically over period of time. Hence opt for based on the finance liability they opt for the required number of installment for clear their due. Hence it could be informative to explore column "payment_installments"

In [39]:
fig = px.histogram(df_copy, x="payment_installments",nbins=18,
                  title='<b>Number of installments for repaying bills</b>')
fig.update_traces(xbins=dict( # bins used for histogram
        start=1,
        end=25.0,
        size=2
    ))
fig.show()

Inferences what we can draw from the above plots are below :

  • 51.646 K of people repay their due in less than 2 installment
  • 14.092 K of people repay their dues in 3-4 installment
  • 7457 customers repay thier dues in 5-6 installment
  • 4872 customers repay thier dues in 7-8 installment
  • 5490 customers repay their dues in 9-10 installment

    Now this reflect that major chunk of customers get-rid of their loan in less just by paying very few installment, which reflect good financial support they have. Where as in compared to those customer very few 5490 customers repay their dues in 9-10 installment which took almost a year to clear his/her all dues, and this might be because of different financial constrains he/s he have. Now amount for which customer takes credit financial institutions are quickly able to recover his/her charges within a maximum of 1 year.

Review Comment¶

After getting the score for the product purchased by customer, its indeed necessary to check what kind of the review messages or comment given by consumer. So instead looking the individual words as the feedback messages, let's create the wordcloud to see most frequently reviewed comment given by customer, which is visual representation of words.

So let's create the cloud of word from the respective columns and see the feedback messages in the pictorial way.

In [85]:
words = ' '.join((i for i in df_copy.review_comment_title.dropna().str.replace(';', ' ').str.replace(',', ' ')))
wc = WordCloud( width=800, height=400, background_color='black',contour_color='#023075',contour_width=3,colormap='rainbow').generate(words)

plt.figure(figsize=(16,8))
plt.title('Customers feedbacks')
plt.axis("off")
plt.grid(False)
plt.imshow(wc);


words = ' '.join((i for i in df_copy.review_comment_message.dropna().str.replace(';', ' ').str.replace(',', ' ')))
wc = WordCloud( width=800, height=400, background_color='black',contour_color='#023075',contour_width=3,colormap='rainbow').generate(words)

plt.figure(figsize=(16,8))
plt.axis("off")
plt.grid(False)
plt.imshow(wc);

All the words which comes from wordcloud hasn't been written US or UK english,so it might required some work to convert these words into the simple english.

Among all type of the words comes as cloud some of the words comes are super, Excelente, Recomendo,boom, muito now all of these words represent state of statisfied customer and they recomend other customer also for purchasing the goods from Olis stores.And this very allign to the review score of 5,which consitute of 53% .

Whereas there exist few words like Intriguer which shows some bad Experience by the some of the customer, which on identification can be reached out to the customer and address the enquiry look for the remedies.

Ordered and confirmation time lag¶

Often it is been Experienced that there exist a time lag between when we give our order and the acceptance of the same order for further processing of packaging,shipping,logistic assignment etc. If the time lag is very nominal then user experience in purchasing from the Olist stores doesn't get hampered much because more the time lag more there is delay in final reaching products to its customer result in more waiting time of customers.

Hence just to make sure that Olist customer remain untouched from this problem, it's recomended to just keep an eye on these factor involving time lag.

In [16]:
# Checking the values present inside the required columns of the dataframe before performining algebric manipulation"
df_copy.order_purchase_timestamp
Out[16]:
0         2018-04-24 16:16:53
1         2018-07-04 18:39:28
2         2018-04-23 19:03:35
3         2018-04-23 19:03:35
4         2018-07-29 20:50:22
                 ...         
117324    2017-01-27 12:10:46
117325    2018-04-24 08:37:20
117326    2018-06-07 22:36:29
117327    2018-01-12 15:28:49
117328    2018-01-12 15:28:49
Name: order_purchase_timestamp, Length: 117329, dtype: object

Any kind of the Algebraic operation we intend to do on columns containing date and time information, we first try to convert the required columns Time stamp

In [24]:
# Created extra columns in the dataframe for converting the particular columns into datetime format data
df_copy['customer_ordered_datetime']=pd.to_datetime(df_copy.order_purchase_timestamp)
In [25]:
df_copy['order_approved_time']=pd.to_datetime(df_copy.order_approved_at)
df_copy['order_approved_time'].dt.time
Out[25]:
0         19:04:19
1         16:33:14
2         19:17:45
3         19:17:45
4         18:31:13
            ...   
117324    12:31:29
117325    17:26:25
117326    02:57:41
117327    02:06:32
117328    02:06:32
Name: order_approved_time, Length: 117329, dtype: object
In [26]:
df_copy['order_day'] = df_copy['customer_ordered_datetime'].dt.day_of_week
df_copy['order_month'] = df_copy['customer_ordered_datetime'].dt.month_name()
df_copy['order_year'] = df_copy['customer_ordered_datetime'].dt.year  
df_copy['order_time'] = df_copy['customer_ordered_datetime'].dt.time 
df_copy['Order_month_year'] =df_copy['customer_ordered_datetime'].dt.to_period('M')

After converting the all of the required columns into timestamp, creating the difference of two time stamp and them convert into result into the hour format.

Inferences what we could make from this plots are below :

  • Less than 0.01 % of customer have higher time lag, whereas minimalistic time lag is very much accepted as to match the inventory, receiving the required information and passing to respective department from ware-house to logistic partner takes minimalistic time .
  • Minimum time lag exists is of 0 hrs, which shows really quick decision making and processing time, by respective department
  • On average customer had faces 10.46 hr as the time lag, which might be because of the difficult in assigning the logistic partner, proper packaging, Availability of product etc .
  • On other hand very negligible or minute proportion of customer got the time lag of 2 Month, which could have resulted because of the non-availability of products or any geographical Challenges etc .

    The general trend of the graphs narrate the story that majority of the customer get Immediate confirmation after placing their order, which indicate of good frequency match between different department of Olist. For the handful number of delay could be lack of product to anything like not able to find the respective logistic partner.

Order status¶

After we got estimate of time lag, it's customary to know the details, of the product ordered by customer has finally reached to it's destination or not ? After all with minimal herdal and delay product should reach to it's customers.

So let's explore the required columns which gives the required informations.

In [40]:
orderStatus_dict=df_copy.order_status.value_counts().to_dict()


fig = px.pie(df_copy.order_status.value_counts()
, values=list(orderStatus_dict.values()), names= list(orderStatus_dict.keys()), color= list(orderStatus_dict.keys()),
             title='<b>Customers order delivery status</b>',
             
           color_discrete_map={'delivered':'darkcyan',
                                 'shipped':'cyan',
                                 'canceled':'royalblue',
                                 'invoiced':'darkblue',
                              'processing':'green',
                              'unavailable':'yellow',
                              'iapproved':'red'}
            
            )




fig.show() 

Figure above shown is self exploratary in terms of numbers and informations it's draws.

  • 97% of product are getting delivered to it's customers which is expected to have as customers pay for their products should be getting it .
  • Whereas there exist minimalistic portion 0.497 % of products which is getting canceled, the reason behind the same is the matter of investigations
  • Apart from these as before product getting delivered to it's final destination, products has to pass different layer like invoiced, unavailability, shipped, under-process etc. And the contribution from each of the individual category is less than 0.5 % .

    So we can say that overall management of the product insuring the delivery is good, whereas 3% of product which is not getting delivery need to have a look and investigate the reason behind the delay

Sellers City¶

As different product is being supplied by different sellers which comes from different geographical locations, so it's very much needed to know their proportions numbers corresponding to city. This can help in maintaining stock in different locations which can reduce the transportation time and cost for customers as well as company.

So let's deep dive into the above column

In [41]:
sellers_dict=df_copy.seller_city.value_counts().head(30).to_dict()
fig = px.bar(df_copy.seller_city.value_counts(), x=list(sellers_dict.keys()), y=list(sellers_dict.values()),
            color=list(sellers_dict.keys()),
            title='Top 30 city of Brazil having maximum number of sellers:')
fig.show()
In [21]:
#import requests
#import urllib.parse

#address = 'Rahika, Madhubani, BR 847238'

#adress=  str(df_copy.seller_city +','+df_copy.seller_state +',' + str(df_copy.seller_zip_code_prefix) )
#url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'

#response = requests.get(url).json()
#print(response[0]["lat"])
#print(response[0]["lon"])

Category of product¶

After getting the severals of details including methods of payments,feedback,location of seller.It's most important to investigate what are the different category of products which are being sold by E-commerece olist stores.Doing this is going to give fair estimates of customers choice and preference for ordering product of their choices and taste.

While traversing through the name of columns of the data frame, column name as "product_category_name" contains the details of different category of products. Also apart from just getting the product category name, our attempt should also be to explore for where maximum mode of payment is credit card which category of products are being ordered most etc. So let's try to give visual representation to this column. So straight jump into the content of the columns.

Top 15 category of product purchased most purchased¶

In [96]:
df_copy.product_category_name.value_counts().head(15)
Out[96]:
cama_mesa_banho           11847
beleza_saude               9944
esporte_lazer              8942
moveis_decoracao           8743
informatica_acessorios     8105
utilidades_domesticas      7331
relogios_presentes         6161
telefonia                  4692
ferramentas_jardim         4558
automotivo                 4356
brinquedos                 4246
cool_stuff                 3964
perfumaria                 3575
bebes                      3178
eletronicos                2827
Name: product_category_name, dtype: int64
In [44]:
%%time
# Check execution time of code-cell
product_cat_dict=df_copy.product_category_name.value_counts().nlargest(30).to_dict()
fig = px.bar(df_copy.product_category_name.value_counts(), 
             x=list(product_cat_dict.values()),
             y=list(product_cat_dict.keys()),
             #hover_data=['lifeExp', 'gdpPercap'], 
             color=product_cat_dict,
            title="<b>Top 30 product category purchased by customers </b>",
             orientation = 'h',
            
             )
fig.update_layout(xaxis_title='Count of Product category', 
                  yaxis_title='Product category')

fig.show()
CPU times: user 55.2 ms, sys: 1.99 ms, total: 57.2 ms
Wall time: 53.7 ms
In [45]:
# Segregating the required columns of dataframe 
df_product_photos_qty=df_copy.groupby('product_category_name').product_photos_qty
In [138]:
df_product_photos_qty.value_counts()
Out[138]:
product_category_name      product_photos_qty
agro_industria_e_comercio  1.0                   91
                           2.0                   50
                           3.0                   23
                           6.0                    9
                           8.0                    9
                                                 ..
utilidades_domesticas      8.0                   39
                           10.0                  24
                           9.0                    9
                           12.0                   2
                           13.0                   1
Name: product_photos_qty, Length: 520, dtype: int64

Volume of different category of products : product_length_cm,product_height_cm,product_height_cm¶

After getting every details of related to products like category, product seller and it's zip code etc it's indeed necessary to know the volume of the products which is crucial property for sending the product via logistic partners. Based on product volume different logistic partner has been assigned as well cost of transportation is also being calculated based on the volume of product. Hence it's crucial information to be known as products and it's dimension is being concern.

So the dimension of products is being know by getting the length, breadth and height of the product. So we can perform the algebric operations on each of the columns. And then group the product based on the category and this way would get the estimate of the volume of product in respective category.

In [27]:
df_copy['Product_volume']=(df_copy.product_length_cm * df_copy.product_height_cm * df_copy.product_width_cm)/1000
df_copy['Product_volume'].sample()
Out[27]:
98541    6.4
Name: Product_volume, dtype: float64

Interesting question and their Answer¶

After getting the necessary depth of the EDA perform on the every columns of the dataset.Let's use these analysis did in the above cell to answer some of the pin points question and finding necessary answer for these questions:

  1. Which are the Top 20 category of product customers bought by paying throug credit card? ?
  2. What's states of Brazil peoples favourite method of payment by customer at Olist-stores?
  3. Which category of product have highest volume (dimension) or greater than avg volume of products? ?
  4. What's orders counts of particular month of year for the given customers order?
  5. From which state of Brazil, there is maximum number of sellers ?
  6. How's the distribution of customers across the different geographical location of Brazil ?
  7. From which region of Brazil, Olist stores receive maximum orders ?
  8. In Which region of Brazil, customers orders is not getting delivered ?
  9. Which region of people use mostly credit card as their payment method ?

Q1. Which are the Top 20 category of product customers bought by paying throug credit card?¶

The answer of this question revolve around extracting the information of one column product_category_name to another payment_category: So on combining the information drawn from both columns would give the desired answer :

In [16]:
%%time
# Filtering the mode of payment as "credit_card"
df_copy_credit_card=df_copy[df_copy.payment_type=='credit_card']
# Using the filtered data top 20 product category : df_copy_credit_card.product_category_name.value_counts().nlargest(20)
line_colors = ["#0a2e36", "#27FB6B","#14cc60","#036d19","#09a129"]


credit_card_Payment_product_cat=df_copy_credit_card.product_category_name.value_counts().nlargest(20).to_dict()

fig = px.treemap( df_copy_credit_card.product_category_name.value_counts().nlargest(20), 
                 path=[list(credit_card_Payment_product_cat.keys())], 
                 
                 values=df_copy_credit_card.product_category_name.value_counts().nlargest(20),
                 
                  width=1000, height=850,
                 #values=list(credit_card_Payment_product_cat.values()),
                 title="<b>Category of product purchased by credit card </b>",
                 
                 color='product_category_name', 
                 
                 
                color_discrete_sequence=line_colors,
                  
                template = "none",
                 
                
                 #color_continuous_scale='RdBu' 
                
                )
fig.show()
CPU times: user 221 ms, sys: 29.3 ms, total: 250 ms
Wall time: 245 ms

Let's try to interpret the result of this tree map: This plots represent the list of Top 20 products which are bieng purchased via credit card or we can also say that credit card payment which was most favourable method of payment was used for purchasing which type of product.

Now the each box in the graph show one category of product, hence 20 product so 20 rectangular box. And we know that rectangular box dimension represent the counting the category of product.So whichever rectangular box (means particular product category) have higher length and breadth represent larger dimension means that product category has larger counts. And similarly those box (product category) which looks very smaller means smaller counts i.e that product category have smaller counts. Following this interpration we can say that product category name cama_mesa_banho have larger counts followed by beleza_saude and esporte_larzer have

Q2. Which category of product have highest volume (dimension) or greater than avg volume of products?¶

So getting this information can also help in categorizing the product which can be used for getting the different logistic partner for smoother delivery as well those products can required bigger space.

Answer to this question can be grab by combining the information of two columns which are product_category_name & Product volume of the data frame df_copy.

So first we will segregate the information of volume from the data frame and then we will find the particular product category.

In [28]:
df_great_vol=df_copy[df_copy['Product_volume']> df_copy['Product_volume'].mean()]
In [22]:
# Droping the na values from particular columns of the dataframes
df_great_vol=df_great_vol.dropna(axis=0,subset=['product_category_name'])
In [23]:
df_great_vol=df_great_vol.dropna(axis=0,subset=['order_status'])
In [24]:
# Rechecking there exist any na values
df_great_vol['product_category_name'].isna() .sum()
Out[24]:
0
In [25]:
# Rechecking there exist any na values
df_great_vol['order_status'].isna() .sum()
Out[25]:
0
In [29]:
great_vol_dict=df_great_vol.product_category_name.value_counts().nlargest(15).to_dict()

fig = px.pie(df_great_vol.product_category_name.value_counts().nlargest(15)
, values=list(great_vol_dict.values()), names= list(great_vol_dict.keys()), color= list(great_vol_dict.keys()),
             title="<b>Category of product whose volume greater than mean volume </b>",
             hole=0.4
             
            
            )
fig.update_traces(textposition='inside')
fig.show() 

On looking this graphs it's seen that circle consist of different section or whole circle has been divided into different pieces. So each slice of circle represent each category of products and wider the slice of the circle, it means larger proportion of particular property.And the color of the each slice of the circle represent corresponding volume, and the numeric representation has been shown on the scale meter. So on combing all the information we can infer that.

Hence we can say that product_categoryname cama_mesa_banho, utilidades_domesticas,moveis_decoracao followed by casa_conforto have smaller count, which gives information of larger vehicle transportation for delivery of these product category.

Q3. How is the number of order received by the Olist E-Commerce stores varry across the day and month of time stamp?¶

Answer of this question involves around exploring the columns like shipping_limit_date, price, freight_value,order_status, order_purchase_timestamp, order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date. Now several information like days,month,year,time-stamp and other details has to be extracted on finer level.

In [166]:
%%time
df_1=df_copy.sort_values(by=['payment_value'],ascending=False)
sns.set_style("darkgrid")
sns.set(rc={"figure.figsize":(15, 8)})
sns.barplot(data=df_copy, x='order_month', y='payment_value',hue='order_year')
plt.xlabel("Ordered Month" , size = 12)
plt.ylabel("Payment_Amount" , size = 12)
plt.title(" Order_month  vs Customers payment_value for different years" , size = 24)
CPU times: user 1.2 s, sys: 0 ns, total: 1.2 s
Wall time: 1.2 s
Out[166]:
Text(0.5, 1.0, ' Order_month  vs Customers payment_value for different years')

Inferences which can be drawn from this plot can be explained in Tabular manner for better understanding

So we can say year 2018 Olist stores was able to generate constant revenue throughout the year, where as in the year 2016 stores revenue generation was varying in nature and not as steady as 2018.But the year 2017 Olist revenue generation follow steady pattern as year 2018.

Q4. What's orders counts of particular month of year for the given customers order?¶

Often the result of which month of year has been benifical for the Olist stores, as maximum number of order received play crucial role as getting the details of financial year.

So we will try to extract the month&year in concatinated way and takes it's counts as a measure of the number of order received at the E-commerece stores.

In [46]:
order_month_dict=df_copy['Order_month_year'].value_counts().to_dict()
df_copy['Order_month_year']=df_copy['customer_ordered_datetime'].dt.year.astype(str) + "-" + df_copy['customer_ordered_datetime'].dt.month.astype(str)
fig = px.bar(df_copy['Order_month_year'].value_counts(), 
              y=list(order_month_dict.values())            
             )
fig.update_layout(
    title="<b>Customer order month of particular year wise</b>",
    xaxis_title="Month of year",
    yaxis_title="No.of orders",
    font=dict(family='Arial',
             size=14,
             )
)
fig.show()

The story which is being narated from this bar graph is for Olist stores as E-Commerce stores the Oct month of 2016 has received the minimalistic orders which counts for 5K from the customers, whereas the among all other year, Oct-Nov of year 2017 has received the maximum order whose counts is 9K. And the other months of the subsequent year of Olist stores has received the nearly same order, with minimalistic differences between the number of orders received in the different years i.e acrossyear 2018 Olist stores has received the ~ same orders.

Q5. How do client orders come from different Brazilian regions?¶

As the Olist stores has been spread across different geographical locations of Brazil,but different geographical location may or may not contribute to same number of orders coming from customers to E-Commerece stores.So to check the numbers of orders Olist stores receive from different geo-location is being Analyzed in this plots.

But direction mapping from address or pin-code may not directly result in different geographical location. So here first attempt should be made to know the latitude and Longitude corresponding to different geo location of Brazils

So we need to pass the url to request module of python, and JSON which gives names of region.

In [48]:
import requests
r = requests.get('https://servicodados.ibge.gov.br/api/v1/localidades/mesorregioes')
content = [c['UF'] for c in json.loads(r.text)]
br_info = pd.DataFrame(content)
br_info['nome_regiao'] = br_info['regiao'].apply(lambda x: x['nome'])
br_info.drop('regiao', axis=1, inplace=True)
br_info.drop_duplicates(inplace=True)

Further we need to filter the Latitude & Longitude to get Nomenclature for different gelocation of Brazils.

In [49]:
#Brazils most Northern spot is at 5 deg 16′ 27.8″ N latitude.;
olist_geolocation=df7
geo_prep = olist_geolocation[olist_geolocation.geolocation_lat <= 5.27438888]
#it’s most Western spot is at 73 deg, 58′ 58.19″W Long.
geo_prep = geo_prep[geo_prep.geolocation_lng >= -73.98283055]
#It’s most southern spot is at 33 deg, 45′ 04.21″ S Latitude.
geo_prep = geo_prep[geo_prep.geolocation_lat >= -33.75116944]
#It’s most Eastern spot is 34 deg, 47′ 35.33″ W Long.
geo_prep = geo_prep[geo_prep.geolocation_lng <=  -34.79314722]
geo_group = geo_prep.groupby(by='geolocation_zip_code_prefix', as_index=False).min()

After successful Filtration of different geographical locations, our attempt should be made to grab the required the necessary columns of the dataframe and merge it to get geo-mapping from br_info & geo_group correspondingly to latitude and longitude of dataframe.

In [50]:
# Merging all the informations
df_orders_items = df_copy.merge(br_info, how='left', left_on='customer_state', right_on='sigla')
df_orders_items = df_orders_items.merge(geo_group, how='left', left_on='customer_zip_code_prefix', 
                                        right_on='geolocation_zip_code_prefix')
df_orders_items.head().style.background_gradient(cmap='ocean',axis=None)
Out[50]:
  product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm order_id order_item_id seller_id shipping_limit_date price freight_value seller_zip_code_prefix seller_city seller_state payment_sequential payment_type payment_installments payment_value review_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date customer_unique_id customer_zip_code_prefix customer_city customer_state customer_ordered_datetime order_approved_time order_day order_month order_year order_time Order_month_year Product_volume id sigla nome nome_regiao geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.000000 287.000000 1.000000 225.000000 16.000000 10.000000 14.000000 e17e4f88e31525f7deef66779844ddce 1 5670f4db5b62c43d542e1b2d56b0cf7c 2018-04-30 17:33:54 10.910000 7.390000 3694 sao paulo SP 1 debit_card 1 18.300000 d71da8fd8c6e3adef26be965f065b8a1 5 Vale a pena Super adooorei o delineador ele é bem preto e eu smp usei o da Mac e serio o da belle angel me surpreendeu pq eu achei que por ser barato iria ser ruim mas NAO é ... o unico problema é q ele é pequeno 2018-04-28 00:00:00 2018-05-04 00:55:02 f8a3e963a310aa58b60a5b1fed5bceb5 delivered 2018-04-24 16:16:53 2018-04-24 19:04:19 2018-04-25 17:01:00 2018-04-27 16:42:17 2018-05-09 00:00:00 b1a1199364a4a7fe27c4486ab63f550d 13848 mogi-guacu SP 2018-04-24 16:16:53 2018-04-24 19:04:19 1 April 2018 16:16:53 2018-4 2.240000 35 SP São Paulo Sudeste 13848.000000 -22.358101 -46.943397 mogi guacu SP
1 a035b83b3628decee6e3823924e0c10f perfumaria 53.000000 2235.000000 3.000000 1450.000000 20.000000 25.000000 20.000000 b18cb761efbe70da4838435a349abd07 1 5670f4db5b62c43d542e1b2d56b0cf7c 2018-07-06 19:31:03 268.380000 21.070000 3694 sao paulo SP 1 credit_card 10 289.450000 0f0f85749cb82321d902e390a9c33694 4 nan nan 2018-07-11 00:00:00 2018-07-15 21:32:04 9ff6d1a05ecfe2d907adac11a2e2bfe9 delivered 2018-07-04 18:39:28 2018-07-05 16:33:14 2018-07-05 13:21:00 2018-07-10 22:34:39 2018-07-27 00:00:00 fc68ba9c34778e17224154c255b5656e 84200 jaguariaiva PR 2018-07-04 18:39:28 2018-07-05 16:33:14 2 July 2018 18:39:28 2018-7 10.000000 41 PR Paraná Sul 84200.000000 -24.282197 -49.735125 jaguariaiva PR
2 091107484dd7172f5dcfed173e4a960e perfumaria 50.000000 260.000000 2.000000 183.000000 16.000000 8.000000 13.000000 a7708ffa8966514c098d15e1abfa6417 1 5670f4db5b62c43d542e1b2d56b0cf7c 2018-04-27 19:31:35 7.650000 10.550000 3694 sao paulo SP 1 credit_card 2 57.750000 39489f8518181b82cb86ce264e1ccc55 5 Agora sim Recebido o pedido finalmente produto ótimo 2018-04-26 00:00:00 2018-04-30 21:15:06 360782688ff472f1adbd47a85861751e delivered 2018-04-23 19:03:35 2018-04-24 19:17:45 2018-04-24 16:22:36 2018-04-25 23:02:39 2018-05-15 00:00:00 c67329fd07872a68d6d6d1f1626a0760 3804 sao paulo SP 2018-04-23 19:03:35 2018-04-24 19:17:45 0 April 2018 19:03:35 2018-4 1.664000 35 SP São Paulo Sudeste 3804.000000 -23.505101 -46.487342 sao paulo SP
3 e8b61f78db501ea0ca45677d1ca27de2 brinquedos 49.000000 120.000000 1.000000 400.000000 50.000000 10.000000 25.000000 a7708ffa8966514c098d15e1abfa6417 2 3d871de0142ce09b7081e2b9d1733cb1 2018-04-27 19:31:35 29.000000 10.550000 13232 campo limpo paulista SP 1 credit_card 2 57.750000 39489f8518181b82cb86ce264e1ccc55 5 Agora sim Recebido o pedido finalmente produto ótimo 2018-04-26 00:00:00 2018-04-30 21:15:06 360782688ff472f1adbd47a85861751e delivered 2018-04-23 19:03:35 2018-04-24 19:17:45 2018-04-24 16:22:36 2018-04-25 23:02:39 2018-05-15 00:00:00 c67329fd07872a68d6d6d1f1626a0760 3804 sao paulo SP 2018-04-23 19:03:35 2018-04-24 19:17:45 0 April 2018 19:03:35 2018-4 12.500000 35 SP São Paulo Sudeste 3804.000000 -23.505101 -46.487342 sao paulo SP
4 ccac9976bafbf7e587bd2c29302e2314 perfumaria 53.000000 2235.000000 3.000000 1500.000000 20.000000 25.000000 20.000000 206d1a13596872a713dba14504fdf699 1 5670f4db5b62c43d542e1b2d56b0cf7c 2018-08-01 16:35:17 268.380000 18.070000 3694 sao paulo SP 1 credit_card 8 286.450000 36b616f738d1f617cc12aeb2c01705f3 3 nan nan 2018-08-07 00:00:00 2018-08-07 22:56:38 58c5eb8e7844d1ecd85128e663f49e04 delivered 2018-07-29 20:50:22 2018-07-30 18:31:13 2018-08-01 15:24:00 2018-08-06 18:36:46 2018-08-15 00:00:00 1b6d2664b20a04a1fa9c900aedf29440 71920 brasilia DF 2018-07-29 20:50:22 2018-07-30 18:31:13 6 July 2018 20:50:22 2018-7 10.000000 53 DF Distrito Federal Centro-Oeste 71920.000000 -15.832732 -48.035078 brasilia DF
In [34]:
Total_rows,Total_col=df_orders_items.shape
Total_rows,Total_col
Out[34]:
(117329, 56)

Acquiring only required number of columns of the dataframe, used to getting particular plots.

In [51]:
df_regions_group = df_orders_items.groupby(by=['Order_month_year', 'nome_regiao'], as_index=False)
df_regions_group = df_regions_group.agg({'customer_id': 'count', 'price': 'sum'}).sort_values(by='Order_month_year')
df_regions_group.columns = ['month', 'region', 'order_count', 'order_amount']
df_regions_group.reset_index(drop=True, inplace=True)
In [52]:
df_regions_group = df_orders_items.groupby(by=['Order_month_year', 'nome_regiao'], as_index=False)
df_regions_group = df_regions_group.agg({'customer_id': 'count', 'price': 'sum'}).sort_values(by='Order_month_year')
df_regions_group.columns = ['month', 'region', 'order_count', 'order_amount']
df_regions_group.reset_index(drop=True, inplace=True)
In [41]:
df_regions_group
Out[41]:
month region order_count order_amount
0 2016-10 Centro-Oeste 20 2435.85
1 2016-10 Nordeste 37 5796.65
2 2016-10 Norte 7 1127.30
3 2016-10 Sudeste 244 32298.08
4 2016-10 Sul 71 8614.59
... ... ... ... ...
104 2018-8 Norte 85 14748.32
105 2018-8 Sul 998 118584.98
106 2018-8 Centro-Oeste 388 48628.74
107 2018-8 Nordeste 488 73239.31
108 2018-9 Sudeste 1 145.00

109 rows × 4 columns

In [53]:
fig = px.scatter(df_regions_group, x="month", y="order_amount", size='order_amount',
                   symbol="region",
                 color="region",
                 title='<b>Orders across different Region of Brazil</b>',
              
                )
fig.update_layout(
title="Revenue generation across different Region of Brazil",
yaxis_title="Orders_Amount",
xaxis_title="Month of Year",
font=dict(family='Arial',
size=14,
)
)
fig.show()

The above graphs shows the amount of money customers pays for buying products varying across the different region of the Brazil. So in the graphs each markers shows the different Geo region of the Brazil and for whichever size of marker is bigger than other which means that people of that region spend more at that stores.

Therefor story which is being depicted from the above graphs can be explained as following:

  • Pople of Sudeste region expenditure has increased from the month of Jan of year 2017 from minimalistic to maximum expenditure in the month of Jan 2018 after which in subsequent year Jul 2018 has become saturated and maintain constant expenditure afterward.

  • People of Sul region have expenditure follow afterward, where in the month of Jan,2017 there were minimalistic expenditure in the Olist stores, afterward the people of that region expenditure has increased to ~ BRL 70K , afterward highest amount of expenditure done by people at Olist stores has reached up to ~ BRL 1500K in the month of Nov-Dec 2018.Afterward this month of year people expenditure has been maintained constant.

  • But there exist minimalistic expenditure difference did at Olist stores among the people of region Nordeste & Centro-Oeste region.

  • Whereas the least expenditure did by the people are from Norte region of Brazil, with mean spending comes as ~BRL 90K

Q6. How do order counts differ between various Brazil locations?¶

After getting the track on amount of money people of different region of Brazil spend, it is also necessary to know the hows the number orders vary i.e how many orders people from different region of Brazil has given to Olist E-Commerce stores Month of yearaccordingly.

In [54]:
fig, ax = plt.subplots(figsize=(18, 8)) # Putting the dimension x-axis as 18 and y-axis as 8

sns.lineplot(x='month', y='order_count',data=df_regions_group, hue='region', 
             size='region', style='region', palette='magma', markers=['o'] * 5
             )
sns.set_style('darkgrid')
plt.xlabel("Month of Year" , size = 12)
plt.ylabel("No. of orders" , size = 12)
plt.title("Number of orders across Brazil's various regions" , size = 24)
plt.figure(figsize=(15,20))
#plt.figure(figsize=(15,20))
Out[54]:
<Figure size 1080x1440 with 0 Axes>
<Figure size 1080x1440 with 0 Axes>

The above graphs depict the number of orders coming from different region of Brazil spanning across Month of Year. So following below points can be concluded from the above graphs :

  • The highest number of products which has been ordered by people of different regions of Brazil comes from Sudeste regions, and counts is 65K.
  • Afterward people of region Sul have 2nd highest number of orders given at Olist stores.
  • For the people of region Centro-Oeste,Nordeste, Norte variation of number of orders at Olist stores have minimal differences,as well as the line representing number of orders coming to olist stores vary linear as constantly .
  • The month of Oct-Nov of year 2017 has registred the highest number of orders across different Geographical region of Brazil.

Q7. How's the number of customers vary for different city of Brazil across the year 2016,2017,2018 ?¶

To obtain the number of consumers scattered across the various cities in Brazil, 30K rows of data have been collected from the dataset pool for quick estimation.So on proceeding first have to filter with order_year for different year 2016,2017,2018 and get correspondingly latitude and longitude from the dataframe.

In [54]:
lats = list(df_orders_items.query('order_year == 2017')['geolocation_lat'].dropna().values)[:30000]
longs = list(df_orders_items.query('order_year == 2017')['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))


loc = 'No. of cutomers in Year 2017 from different city of Brazil'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  
# Creating a mapa using folium
map1 = folium.Map(location=[-15, -50], zoom_start=4.0)

# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map1)
folium.TileLayer('Stamen Terrain').add_to(map1)

map1.get_root().html.add_child(folium.Element(title_html))

map1.save('map-with-title.html')
map1
Out[54]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [55]:
lats = list(df_orders_items.query('order_year == 2016')['geolocation_lat'].dropna().values)[:30000]
longs = list(df_orders_items.query('order_year == 2016')['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))

# Adding title to graphs
loc = 'No. of cutomers in Year 2016 from different city of Brazil'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  


# Creating a mapa using folium
map2 = folium.Map(location=[-15, -50], zoom_start=4.0)

# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map2)
folium.TileLayer('Stamen Terrain').add_to(map2)



map2.get_root().html.add_child(folium.Element(title_html))

map2.save('map-order_year == 2016.html')

map2
Out[55]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [56]:
lats = list(df_orders_items.query('order_year == 2018')['geolocation_lat'].dropna().values)[:30000]
longs = list(df_orders_items.query('order_year == 2018')['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))



loc = 'No. of cutomers in Year 2018 from different city of Brazil'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  



# Creating a mapa using folium
map3 = folium.Map(location=[-15, -50], zoom_start=4.0)

# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map3)
folium.TileLayer('Stamen Terrain').add_to(map3)

map3.get_root().html.add_child(folium.Element(title_html))

map3.save('map-withorder_year == 2018.html')



map3
Out[56]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Here

In [58]:
%%time
# Grouping geolocation data for plotting a heatmap
heat_data = df_orders_items.groupby(by=['geolocation_lat', 'geolocation_lng'], as_index=False).count().iloc[:, :3]

# Creating a mapa using folium
map1 = folium.Map(
    location=[-15, -50], 
    zoom_start=4.0, 
    tiles='cartodbdark_matter'
)

# Plugin: HeatMap
HeatMap(
    name='Mapa de Calor',
    data=heat_data,
    radius=10,
    max_zoom=13
).add_to(map1)
folium.TileLayer('Stamen Terrain').add_to(map1)
map3.save('map-withorder_geolocation.html')

map1
CPU times: user 467 ms, sys: 5.97 ms, total: 473 ms
Wall time: 471 ms
Out[58]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Q8. In which parts of Brazils maximum number of orders get undelivered,canceled,&shipped ?¶

In [59]:
df_copy_order_canceled=df_orders_items[df_orders_items['order_status']=="canceled"]
lats = list(df_copy_order_canceled['geolocation_lat'].dropna().values)[:30000]
longs = list(df_copy_order_canceled['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))


loc = 'Region of Brazil where orders is getting canceled'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  


# Creating a mapa using folium
map1 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map1)
folium.TileLayer('Stamen Terrain').add_to(map1)


map1.get_root().html.add_child(folium.Element(title_html))

map1.save('map-with-order_status_canceled.html')


map1
Out[59]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [60]:
df_copy_order_canceled.nome_regiao.value_counts()
Out[60]:
Sudeste         431
Sul              62
Nordeste         34
Centro-Oeste     22
Norte             4
Name: nome_regiao, dtype: int64

The conclusion which can be depicted from the above plots ( for the sample of dataset of 30 K rows) are :

  • The region of Sudeste have maximum number of orders getting Cancelled whose count is 431
  • Whereas the region of Sul have 62 cancellation of orders
  • But the region of Norte is having minimal cancellation of orders which counts for 4.
In [61]:
df_copy_order_delivered=df_orders_items[df_orders_items['order_status']=="delivered"]
lats = list(df_copy_order_delivered['geolocation_lat'].dropna().values)[:30000]
longs = list(df_copy_order_delivered['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
# Creating a mapa using folium

loc = 'Region of Brazil where orders is getting delivered'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  



map2 = folium.Map(location=[-15, -50], zoom_start=4.0)

# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map2)
folium.TileLayer('Stamen Terrain').add_to(map2)


map2.get_root().html.add_child(folium.Element(title_html))

map2.save('map-with-order_status_delivered.html')

map2
Out[61]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [83]:
df_copy_order_shipped.nome_regiao.value_counts()
Out[83]:
Sudeste         725
Nordeste        196
Centro-Oeste    112
Sul             104
Norte            30
Name: nome_regiao, dtype: int64

The products which has been in shipped mode, might because of logistic constrains has been shown in the above map:

  • Sudeste region of Brazil have maximum number of shipped
  • Nordeste region of Brazil have 2nd maximum number of shipped product
  • Whereas Norte region of Brazil have least number of product which are in stage of shipped.
In [62]:
df_copy_order_shipped=df_orders_items[df_orders_items['order_status']=="shipped"]
lats = list(df_copy_order_shipped['geolocation_lat'].dropna().values)[:30000]
longs = list(df_copy_order_shipped['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))



loc = 'Region of Brazil where orders is getting shipped'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  


# Creating a mapa using folium
map3 = folium.Map(location=[-15, -50], zoom_start=4.0)

# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map3)
folium.TileLayer('Stamen Terrain').add_to(map3)


map3.get_root().html.add_child(folium.Element(title_html))

map3.save('map-with-order_status_shipped.html')

map3
Out[62]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [81]:
df_copy_order_delivered.nome_regiao.value_counts()
Out[81]:
Sudeste         78993
Sul             16484
Nordeste        10596
Centro-Oeste     6706
Norte            2080
Name: nome_regiao, dtype: int64

The above graphs shows the different location of Brazil where order is getting delivered( for the sample of dataset of 30 K rows)

  • Sudeste region of Brazil have maximum number of product getting delivered for the sample dataset.
  • Sul region of Brazil have 2nd highest number of product getting delivered for the sample dataset
  • Where as Norte region of Brazil where minimal number of product is beind delivered for the choosen dataset.

Q9. Which region of people use mostly credit card as their payment method ?¶

As economy of states vary from the neighbouring state, this question would be interest of the people to know people of which states is seeking some kind of credit based financial support for their expenditures as well. Not just that but also their capability to repay the amount in minimal time.

So by integrating information of two columns payment_type as well as `

Answer of this question will be given by grabing the information from Mode of Payment column of the dataframe df_copy

On reviewing the statistics and other measures of the required column payment_type it's clear that among the several customers visited Olist stores, peoples mostly like to pay their bill via Credit card over other method of payments

In [63]:
df_credit=df_orders_items[df_orders_items['payment_type']=="credit_card"]
lats = list(df_credit['geolocation_lat'].dropna().values)[:30000]
longs = list(df_credit['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))


loc = 'Region of Brazil where maximum number of credit user'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)  
# Creating a mapa using folium
map2 = folium.Map(location=[-15, -50], zoom_start=4.0)

# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map2)
folium.TileLayer('Stamen Terrain').add_to(map2)

map2.get_root().html.add_child(folium.Element(title_html))

map2.save('map-with-payment_type_credit_card.html')

map2
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [85]:
df_credit.nome_regiao.value_counts()
Out[85]:
Sudeste         59848
Sul             11930
Nordeste         8173
Centro-Oeste     5042
Norte            1527
Name: nome_regiao, dtype: int64
  • People of Sudeste region of Brazil have maximum number of customers who would like to pay bills at Olist stores via Credit card.
  • People of Sul region of Brazil have 2nd maximum number of customers who would like to pay pay bills at Olist stores via Credit card.
  • People of Norte region of Brazil have least number of customers who would like to pay pay bills at Olist stores via Credit card.

Conclusion:¶

After grabbing the information in the form of dataset from the kaggle, by following the data schema final version of the existing the csv file has been obtained.Which afterward followed by performing the open ended analysis for individual columns of the dataframe as well the clubbing the information of the two or more required columns to draw better insight from the available data-set.

  • Majority (~76.3%) of customers visiting the Olist E-Commerece stores would like to pay their bills via credit card, remaining audience would like to pay bill via Balleto, and very 1.44% of customers would like to pay via debit card
  • People who pay via credit card, repay their credit amount to financial institution with 2-3 installment, which represent their better financial conditon, which can be often lead to potential buyers from the Olist stores.
  • For the product purchased from the olist stores ~56 % of customers have given 5 star as rating, which symbolize good product and services being offered by Olist stores for their customers.
  • Out of the total product ordered 97% of item get delivered to it's customers, whereas 3% of product ordered by customers get lag and didn't get delivered to it's customers which need to investigate and look for it's resolutions
  • cama_mesa_banho category of products are being made available on the Olist stores for their customers, which afterward is being followed by other category of products like eletronicos which counts very less for top 15 product category in Olist stores.
  • cama_mesa_banho category of product is being purchased maximum via credit card.
  • Among the all the customers purchase product from Olist stores people of region Sudeste have use Credit card most whereas people of Norte region use less credit card for payment at olist stores.
  • Among several available Brazil cities Sao palo city has maximum number of sellers whose counts as 28K and rest of the cities having lesser counts.
  • Under the list of Top 20 category of Products which is being purchased by credit_card cama_mesa_banho have higher counts means customers purchase this product category most whereas fashion_bolsas_e_acessorios have list counts which means less people purchased from Olist stores.
  • As well as the cama_mesa_banho,utilidades_domesticas & movies_decoracao have higher counts which give bulkier logistic requirement of seller.
  • In the month of september for year 2017 customer spend more amounts on purchasing the products from the Olist stores,whereas other Month of year 2016,2018 people have spent lesser amounts for purchase of products from the e-commerce stores,in month of december 2016 customers have spent least amount at E-Commerce stores for purchase of products. So the month of september-2017 was maximum revenues generation of stores happen that duration.
  • But the month of Oct-Nov of year 2017 maximum no. of orders means large number of products have been purchased at Olist stores by customers.
  • Maximum number of customers visiting the Olist stores belong to Sudeste region of Brazil, whereas very less number of customers belong to the region of Norte of Brazil and for other region of Brazil have equal number of customers.
  • Among different region of Brazil people of Sudeste region purchase large number of products from Olist stores followed by Sul region afterward , which afterward people from Centro-Oesteregion purchase less number of products and Norte region have negligible number of products being purchased from Olist stores. Which follow parallely with the information above that less number of customers belong to region of Norte region, that lesser number of customers have lesser counts from purchasing the products and services.
  • Sudeste region have maximum number of products ordered by customers get canceled as well as same region of Brazil have maximum number of product getting shipped
  • As maximum number of products getting shipped in region of Sudeste also have maximum number products are getting delivered, which holds incline as maximum number of products shipped is summation of getting products delivered as well as canceled.

Future work:¶

Apart from the Open-Ended question deal by exploring the individual columns as well as combining the different columns to explore interesting question, it's also required to explore to extract more minute details such as

  • Orders counts during the different time stamp of a week, to get estimate which is most favourable time to do purchase from the stores which can give guidance to maintain the all the required in these time
  • Total number of customers in the different region of brazil , their order's counts , amount of expenditure which can help in revenue generations can be explore.
  • Year wise Comparison of the number of products solds in each category for different region can be made compared to check the growth the revenue as well as sells of the stores.
  • Strategy to reduc the number of cancelled orders and long delivery time for different product category wise also need to explore, also reason of product cancellation for particular regions can be explore.
  • The avg amount of time users spent of the stores can also need to be explore, which can be used better exploration of other goods and services to be offered for customers.
In [35]:
%%time
jovian.commit()
[jovian] Updating notebook "jhagautamkumar362/eda-project-working" on https://jovian.ai
[jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/eda-project-working
CPU times: user 81.5 ms, sys: 33.5 ms, total: 115 ms
Wall time: 2min 24s
Out[35]:
'https://jovian.ai/jhagautamkumar362/eda-project-working'

Referances¶

  • https://seaborn.pydata.org/generated/seaborn.set_style.html
  • https://datascientyst.com/convert-datetime-day-of-week-name-number-in-pandas/#:~:text=How%20to%20Convert%20DateTime%20to%20Day%20of%20Week,Get%20day%20name%20and%20number%20of%20week%20
  • https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
  • https://www.geeksforgeeks.org/seaborn-distribution-plots/
  • https://plotly.com/python/line-charts/